Skip to content

Alternate fix for conversations that stop responding due to stale execution status #2470

Open
DoubleDensity wants to merge 5 commits intoOpenHands:mainfrom
DoubleDensity:stuck_conversation_fix_alternate
Open

Alternate fix for conversations that stop responding due to stale execution status #2470
DoubleDensity wants to merge 5 commits intoOpenHands:mainfrom
DoubleDensity:stuck_conversation_fix_alternate

Conversation

@DoubleDensity
Copy link

… RUNNING reset already happens outside the loop at lines 586-592, so we shouldn't check for STUCK again inside the loop and break.

Summary

[fill in a summary of this PR]

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

… RUNNING reset already happens outside the loop at lines 586-592, so we shouldn't check for STUCK again inside the loop and break.
@DoubleDensity
Copy link
Author

also did not fix the issue after further testing -- please do not merge yet

@DoubleDensity
Copy link
Author

this one looks promising, was able to resume a previously hung base_state.json at least. will continue testing to see if any stuck sessions occur again

@DoubleDensity
Copy link
Author

@OpenHands /codereview-roasted

@openhands-ai
Copy link

openhands-ai bot commented Mar 17, 2026

I'm on it! DoubleDensity can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Mar 17, 2026

Summary

I analyzed PR #2470 "Alternate fix for conversations that stop responding due to stale execution status" and provided a code review using the /codereview-roasted persona.

What I did:

  1. Retrieved PR details and diff using GitHub CLI
  2. Analyzed the code changes in local_conversation.py
  3. Provided a code review with:
    • Verdict: ✅ Worth merging - the core fix (changing continue to break when stuck is detected) correctly addresses the infinite loop bug
    • Key finding: The original bug was that when stuck was detected, the code set status to STUCK and used continue, which would loop back. With STUCK removed from the break condition, it would keep looping infinitely. The fix correctly breaks out of the loop.
    • Minor improvement suggestions around documentation clarity and potential extraction of stuck detection logic into a helper method

I did not modify any code - the task was to provide a code review comment as requested by the /codereview-roasted command on the PR.

@xingyaoww
Copy link
Collaborator

@OpenHands /codereview-roasted

@openhands-ai
Copy link

openhands-ai bot commented Mar 18, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link
Collaborator

🔴 Needs improvement — Patching symptoms, not the disease

Linus-Style Analysis

[CRITICAL ISSUES]

1. The author already told us it doesn't work

"also did not fix the issue after further testing -- please do not merge yet"

Let me repeat that: the person who wrote this PR says it does not fix the problem. Why are we reviewing code that the author has disowned? This should be closed or converted to a draft, not reviewed.

2. [local_conversation.py, Lines 637-656] last_event_count tracking is a band-aid on a band-aid

The original code's continue → STUCK check → break two-step dance is admittedly ugly. Replacing it with a direct break is fine — that's the one good idea here. But then you bolt on last_event_count tracking in three separate places (line 595, ~654, ~671) to prevent is_stuck() from being called "too often." This is solving an imaginary problem while the real one walks past you waving a flag.

The is_stuck() detector already has its own thresholds and only scans the last 20 events. Calling it once per loop iteration is cheap. The real question you should be asking is: why does the conversation stop responding after stuck detection? Adding event-count guards doesn't answer that.

3. [local_conversation.py, Lines 646-656] The break creates a new problem

The original flow was: detect stuck → set STUCK → continue → next iteration checks STUCK → break. Your fix: detect stuck → set STUCK → break. Same outcome, cleaner path — good.

But here's what you missed: after run() exits via break, nothing resets the status to IDLE. Look at the flow:

  1. Stuck detected → status = STUCK → break out of loop
  2. run() returns
  3. User sends new message via send_message() → sees STUCK → resets to IDLE ✓
  4. User calls run() again → sees STUCK at line 586-592 → resets to RUNNING ✓

OK, that part is actually fine. But the original code had the same path (continue → break), so your change to break is functionally identical to the original for stuck detection. The only behavioral difference is the last_event_count guard, which skips stuck detection when no new events appear. If the agent is spinning without producing events (e.g., an empty step), you'll never detect stuck. That's a regression.

4. [local_conversation.py, Line 595 / Line 654 / Line 671] Three-way state tracking is a complexity smell

last_event_count is updated at initialization, after stuck check, and after agent.step(). That's three places maintaining the same bookkeeping. When the next person touches this loop, they'll need to reason about all three to understand when stuck detection fires. This violates the "if you need more than 3 levels of indentation, you're screwed" principle — except here it's "if you need a mental flowchart to track one variable, you're screwed."

[IMPROVEMENT OPPORTUNITIES]

5. [PR description] "[fill in a summary of this PR]" — Really?

The summary is literally a placeholder. This tells me the PR was submitted in a rush. If you can't explain what your change does in one paragraph, you probably don't fully understand what it does yet.

[TESTING GAPS]

6. Zero tests for a change to the core conversation loop

This modifies the central run loop — the beating heart of conversation execution. The stuck detection path, the break/continue behavior, the interaction between run() and send_message() across threads — none of it is covered by a new test. For a bug fix that the author says "did not fix the issue," the absence of a reproducer test is damning. How do you know when you've actually fixed it if you can't reproduce it?

7. [PR description] No evidence section

No reproduction steps, no logs showing the stuck behavior, no before/after comparison. We're flying blind.


VERDICT:
Needs rework — The continuebreak simplification is the right instinct, but everything else is misguided complexity. The last_event_count optimization solves a non-problem while introducing a potential regression (skipping stuck detection when step produces no events). Most critically, the author confirms this doesn't fix the actual issue.

KEY INSIGHT:
The real bug is likely a concurrency issue between run() and send_message() around the state lock, not the continue vs break choice in stuck detection. Investigate what happens to the state lock when send_message() fires between stuck detection and the loop restart — that's where conversations go to die.

@openhands-ai
Copy link

openhands-ai bot commented Mar 18, 2026

Summary

I completed the /codereview-roasted request on PR #2470 by:

  1. ✅ Retrieved PR context — Fetched the diff, PR metadata, and identified it's from a fork by DoubleDensity.
  2. ✅ Analyzed the full code context — Read the local_conversation.py run loop, the StuckDetector implementation, send_message() interaction, and post-loop code to understand the complete picture.
  3. ✅ Posted a Linus Torvalds-style roasted code review as a PR comment, covering:
    • The author's own admission that the fix doesn't work
    • The last_event_count tracking adding unnecessary complexity (3 update sites) for a non-problem
    • The continuebreak change being functionally identical to the original flow
    • A potential regression where stuck detection is skipped when no new events appear
    • Missing tests, incomplete PR description, and no evidence section
    • A hypothesis that the real bug is a concurrency issue around the state lock, not the continue/break choice

No code changes were made — this was a review-only task, which is correct for a /codereview-roasted request. The review was posted at: #2470 (comment)

@xingyaoww
Copy link
Collaborator

@OpenHands based on this #2470 (comment) can you do an investigation on the actual cause of this issue?

@openhands-ai
Copy link

openhands-ai bot commented Mar 18, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link
Collaborator

Investigation: Root Cause Analysis for "Conversations That Stop Responding"

After a thorough investigation of the codebase, I analyzed the run loop, send_message(), StuckDetector, FIFOLock, the ConversationState autosave mechanism, and the agent-server's EventService layer. Here are my findings:


TL;DR

The PR's changes to stuck detection (continuebreak, event count tracking) are addressing symptoms, not root causes. The stuck detection flow (continue → reacquire lock → check STUCK → break) is functionally equivalent to a direct break. The real issues lie in several concurrency and design problems across the SDK and agent-server layers.


Finding 1: TOCTOU Race in event_service.send_message() (Agent-Server)

File: openhands-agent-server/openhands/agent_server/event_service.py, lines 308-331

async def send_message(self, message: Message, run: bool = False):
    # Step A: Add message (blocking executor call)
    await loop.run_in_executor(None, self._conversation.send_message, message)
    # Step B: Check status (acquires lock briefly)
    if run:
        with self._conversation.state as state:
            run = state.execution_status != ConversationExecutionStatus.RUNNING
    # Step C: Start run if needed
    if run:
        loop.create_task(_run_with_error_handling())

There is a time-of-check-to-time-of-use gap between Step A completing and Step B checking status. During this gap:

  • The existing run loop could have already picked up AND processed the new message
  • The run loop could have exited (status changed from RUNNING → FINISHED)
  • Step B then sees a non-RUNNING status and starts a redundant run via Step C

While this typically results in harmless no-op runs (the new run() sees FINISHED and immediately breaks), it represents fragile logic that could cause unexpected behavior under specific timing conditions.

Finding 2: Untracked Fire-and-Forget Run Tasks (Agent-Server)

File: event_service.py, lines 325-331

# send_message creates untracked tasks:
loop.create_task(_run_with_error_handling())  # NOT stored in self._run_task

# But run() checks _run_task to prevent double-running:
if self._run_task is not None and not self._run_task.done():
    raise ValueError("conversation_already_running")

send_message() creates fire-and-forget run tasks that are NOT stored in _run_task. This means event_service.run() cannot detect runs started by send_message(), potentially leading to two concurrent conversation.run() calls. The FIFO lock serializes them, so they don't truly run in parallel, but the second run may encounter unexpected state after the first completes.

Finding 3: Exception Handler Outside Lock (SDK)

File: local_conversation.py, lines 690-705

        except Exception as e:
            self._state.execution_status = ConversationExecutionStatus.ERROR  # ← No lock!
            self._on_event(ConversationErrorEvent(...))

The exception handler in run() sets execution_status = ERROR outside any with self._state: block. If another thread (e.g., send_message()) is concurrently modifying state, this creates a race condition on the status field. The autosave in __setattr__ would persist whichever write happens last, potentially losing the ERROR status.

Finding 4: Non-Atomic File Writes (SDK)

File: io/local.py, lines 62-68

def write(self, path: str, contents: str | bytes) -> None:
    with open(full_path, "w", encoding="utf-8") as f:
        f.write(contents)

_save_base_state() writes base_state.json non-atomically. If the process crashes mid-write (which is plausible during long-running conversations), the file could be partially written, leading to corrupted JSON that fails to load on restart. This would manifest as a load error rather than "stale status," but is a robustness issue.

Finding 5: Status Check Blocks the Event Loop (Agent-Server)

File: event_service.py, lines 313-315

if run:
    with self._conversation.state as state:  # ← Sync lock on async thread!
        run = state.execution_status != ConversationExecutionStatus.RUNNING

This acquires the synchronous FIFO lock directly on the asyncio event loop thread. If the run loop is holding the lock during agent.step() (which includes LLM API calls potentially taking 10-30+ seconds), the entire event loop blocks. This means:

  • No WebSocket messages can be processed
  • No HTTP responses can be sent
  • No other coroutines can execute

This is likely the most impactful contributor to the "stop responding" perception — the server literally cannot respond to any requests while the lock is held during agent.step().

Finding 6: Related Issues - False Positive Stuck Detection

Issues #2482 and #1349 describe scenarios where:

  • Extended thinking models produce reasoning-only responses → emitted as MessageEvent → 3 consecutive agent MessageEvents triggers the monologue stuck detector → status becomes STUCK
  • LLM text responses without tool calls set FINISHED prematurely

These could compound with the above concurrency issues to create the "stop responding" behavior.


Recommendations (Priority Order)

  1. Move the status check in event_service.send_message() to a thread (Finding 5) — or better, make the run decision inside conversation.send_message() itself while the lock is already held, avoiding the TOCTOU gap entirely (Findings 1 & 5).

  2. Track all run tasks (Finding 2) — send_message() should store its run task in _run_task (or a shared tracking mechanism) so run() can detect and prevent double-running.

  3. Move the exception handler inside the lock (Finding 3):

    except Exception as e:
        with self._state:
            self._state.execution_status = ConversationExecutionStatus.ERROR
  4. Use atomic writes for base_state.json (Finding 4) — write to a temp file then rename (which is atomic on most filesystems).

  5. Address [Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482 — Fix the monologue false positive, which is a contributor to unnecessary STUCK states.


Why This PR Doesn't Fix the Issue

The PR correctly identifies that continue after stuck detection creates an unnecessary round-trip through the loop. But changing continue to break is functionally equivalent (the original continue → reacquire lock → STUCK check → break path produces the same result). The last_event_count tracking adds complexity without addressing the real problems listed above.

The root cause of "conversations that stop responding" is most likely Finding 5 (synchronous lock blocking the event loop), possibly combined with Finding 2 (untracked run tasks) and Finding 6 (false positive stuck detection). These are architectural issues in the agent-server layer, not in the stuck detection logic.

@openhands-ai
Copy link

openhands-ai bot commented Mar 18, 2026

Summary

I was asked to investigate the actual root cause of the "conversations that stop responding due to stale execution status" issue referenced in PR #2470.

What I did:

Key findings — 6 root causes identified:

  1. TOCTOU race in event_service.send_message() — Gap between adding message and checking whether to auto-run
  2. Untracked fire-and-forget run taskssend_message() creates tasks not tracked by _run_task, allowing potential double-running
  3. Exception handler outside lockrun()'s except clause sets ERROR without the FIFO lock
  4. Non-atomic file writesbase_state.json could be corrupted on crash
  5. Synchronous lock blocking the event loop (most impactful) — event_service.send_message() acquires the sync FIFO lock on the asyncio thread; when agent.step() holds this lock during slow LLM calls, the entire server becomes unresponsive
  6. False positive stuck detection — Issues [Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482/[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1349 show the stuck detector triggers incorrectly on extended thinking models

Conclusion: The PR's stuck detection changes (continuebreak, event count tracking) are functionally equivalent to the original code and don't address the real problems. The most likely root cause is Finding 5 — the server literally cannot process any requests while the FIFO lock is held during agent.step(). No code changes were made — this was an investigation-only task as requested.

@all-hands-bot
Copy link
Collaborator

[Automatic Post]: I have assigned @simonrosenberg as a reviewer based on git blame information. Thanks in advance for the help!

@DoubleDensity
Copy link
Author

Just adding that this fix has held strong under continuous use for 48 hours, have not been able to reproduce the original issue anymore.

@xingyaoww xingyaoww requested review from csmith49 and enyst March 20, 2026 16:57
Copy link
Collaborator

@csmith49 csmith49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, modulo the event count tracking. I can't see a situation where it actually modifies the control-flow, and worry it will act as a red herring when resolving future issues. Will approve this if it's removed or if we can figure out that it's necessary for this fix.

@simonrosenberg
Copy link
Collaborator

simonrosenberg commented Mar 20, 2026

Just adding that this fix has held strong under continuous use for 48 hours, have not been able to reproduce the original issue anymore.

Do you have a script that runs the agent a few hundred times so that we can reproduce the problem?
If so, can you validate that the event count tracking is indeed necessary to fix the issue, that the stuck part itself is not enough? Thanks!

Edit: if it's a burden to run that many runs with a personal key, I can always take over :) let me know!

@DoubleDensity
Copy link
Author

DoubleDensity commented Mar 20, 2026

thanks for taking a look at this @simonrosenberg ! I don't have a script or environment to automated test this in, I can only give anecdotal confirmation at the moment. I have been using OpenHands at work these past few weeks, but having limited success because of this issue breaking my conversations. Since I have put this patch in I have been using it non-stop with Nemotron 3 Super 120B on a DGX Spark w/ vLLM and OpenHands has not locked up again once, that was about three days ago now.


# Check for stuck patterns if enabled
# Only check if new events have been added since last check
# to avoid repeatedly triggering on the same pattern
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know by any chance what was the pattern? Or which of the stuck scenarios got triggered?

@enyst
Copy link
Collaborator

enyst commented Mar 21, 2026

@DoubleDensity Thank you for looking into this! I think we fixed just today the cases where Qwen had tool calls in reasoning content, which caused it to the get stuck and fail a lot. I don't know if if the cause could be similar to what is causing this, just so you know it was described in this comment and related investigations and PR.

I wonder if you could try running on an updated main?
PR:

The problem there was that the SDK interpreted the messages with empty content and empty tool calls as empty messages (since content and tool calls were 'hidden' in reasoning where we don't read them), and then 3 empty messages trigger the STUCK detector.

@DoubleDensity
Copy link
Author

thanks @enyst I'll test it out!

@DoubleDensity
Copy link
Author

hey @enyst ! I tried running #2512 for a couple days to see it was a better fix, and it looked promising. Went ~30 hours with no issues, but now has become stuck:

~/.openhands/conversations/129e7f493c6f499c881902714f3398e5$ cat base_state.json | jq '.execution_status'
"stuck"

I will make a backup of that conversation in case there is any useful metadata you want to extract from it

@DoubleDensity
Copy link
Author

DoubleDensity commented Mar 26, 2026

it does seem to at least partially fix the problem; I was able to ask OpenHands to continue after it became stuck, and was able to resume successfully. so #2512 is definitely an improvement, but it can still stop before reaching completion

@DoubleDensity
Copy link
Author

with #2512 it resumed, but then stopped again after only 4 turns. not sure how to explain it, but not ideal

@enyst
Copy link
Collaborator

enyst commented Mar 26, 2026

What was the reason for stopping again and did the LLM actually repeat itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants